Measuring the Structural Similarity of Semistructured Documents Using Entropy
نویسنده
چکیده
We propose a technique for measuring the structural similarity of semistructured documents based on entropy. After extracting the structural information from two documents we use either Ziv-Lempel encoding or Ziv-Merhav crossparsing to determine the entropy and consequently the similarity between the documents. To the best of our knowledge, this is the first true linear-time approach for evaluating structural similarity. In an experimental evaluation we demonstrate that the results of our algorithm in terms of clustering quality are on a par with or even better than existing approaches.
منابع مشابه
Information-Theoretic Approaches for Measuring the Structural Similarity of Semistructured Documents
We propose and experimentally evaluate different approaches for measuring the structural similarity of semistructured documents based on informationtheoretic concepts. Common to all approaches is a twostep procedure: first we extract and linearize the structural information from documents and then we use similarity measures that are based on, respectively, Kolmogorov complexity and Shannon entr...
متن کاملXML Document Clustering Using Structure-Preserving Flat Representation of XML Content and Structure
With the increasing use of XML in many domains, XML document clustering has been a central research topic in semistructured data management and mining. Due to the semistructured nature of XML data, the clustering problem becomes particularly challenging, mainly because structural similarity measures specifically designed to deal with tree/graph-shaped data can be quite expensive. Specialized cl...
متن کاملخوشهبندی فراابتکاری اسناد فارسی اِکساِماِل مبتنی بر شباهت ساختاری و محتوایی
Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...
متن کاملA Novel Approach to Measuring Structural Similarity between XML Documents
Measuring structural similarity between XML documents has become a key component in various applications, including XML mining, schema matching, and web service discovery, among others. This paper presents a novel structural similarity measure incorporating kernel methods into XML documents. Results on preliminary simulations show that this approach outperforms conventional ones.
متن کاملSYDE 676 Project Report – Fall 2002 Web Document Clustering Using Phrase-based Document Similarity
Measuring the similarity between documents is an essential operation in text mining, especially document clustering. The traditional method of finding the similarity between documents has always been based on extracting individual words from the documents, and using heuristics to give weights to those features. Standard methods in data mining are then used to find the similarity between documen...
متن کامل